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ABSTRACT 

In  the  past  decade  a  number  of  fixed  sampling  methods  have 
been  developed  for  selecting  the  "best"  or  at  least  a  "good"  sub¬ 
set  of  variable  in  regression  analysis.  We  are  interested  in 
deriving  a  sequential  selection  procedure  to  select  a  subset  of  a 
random  size  including  all  good  regression  equations.  Tables  for 
an  example  are  given  at  the  end  of  this  paper. 


1.  INTRODUCTION 

In  the  past  decade  a  number  of  fixed  sampling  methods  have 
been  developed  for  selecting  the  "best"  or  at  least  a  "good"  sub¬ 
set  of  variables  in  regression  analysis  (see  e.g.  Arvesen  and 
McCabe  (  1975)  and  Spj^tv/611  (  1972)).  In  this  paper,  we  are  inter 
ested  in  deriving  a  sequential  selection  procedure  to  select  a 
random  size  subset  including  all  "good"  regression  equations. 
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Tables  for  an  example  are  given  at  the  end  of  this  paper. 

2,  SEQUENTIAL  SUBSET  SELECTION  PROCEDURE 

Before  discussing  the  regression  problem,  we  develop  general 
results  applicable  to  the  selection  of  "goodM  or  "superior"  popu¬ 
lations  defined  later. 

Let  ttq,  ttj  ». . .  ,71^  denote  k+1  normal  populations  with  unknown 

2  2  2  2 
means  uq,u«|  , . . .  and  variances  Oq,o-|  , . .  -  *0^ .  Assume  that  Dq 


is  known  but 


0  1  i  £  k)  are  unknown.  Let  the  ranked 


2  2  2 

values  of  be  denoted  by  Ojjj  <_.  ..<  We  wish  to  derive  a 

method  to  construct  a  sequential  procedure  to  select  a  subset  con¬ 
taining  all  "superior"  populations  -  the  populations  with  smaller 

variances,  with  a  probability  not  less  than  P*,  (0  <  P*  <  1),  a 

2 

specified  constant.  We  assume  that  og  =  1. 

Let  X-n  denote  the  nth  observation  from  population  7^.  It  is 
assumed  that  the  observations  X.p...,X^n  are  independent  random 
variables.  Define 


Sn  "  n  Xij’ 


Sin  '  „--T  I,  <X,j  - 

2 

The  selection  procedure  will  depend  upon  {s.^}  which  is  a  suffi¬ 
cient  and  transitive  sequence  and  also  invariantly  sufficient  for 


A  population  tt  .  is  said  to  be  "superior"  (or  "good")  if 

o ?  2  A’  L°  be  "inferior"  (or  "bad")  if  o?  >  A,  where  A  is  a  speci¬ 
fied  constant  greater  than  1.  Let  u  be  the  parameter  space  which 
is  the  collection  of  all  possible  parameter  vectors  0  = 

(o  1 1 .  .  .  ,0?) .  Let  t  denote  the  unknown  number  of  inferior 


populations  in  the  given  collection  of  k  populations.  We  have 
0  <  t  i  k.  Let 

"t  =  (?|o[l]  -**•--  °[k-tj  -  A  <  °[k-t+l]  °[k]}- 

k 

Then  n  =  u  n. . 

t=0  1 

For  the  subset  selection  procedure  R,  two  constants  A  and  P* 
with  a  •  1,  1  >  P*  •>  0,  are  specified  and  we  wish  to  select  a  sub¬ 
set  containing  all  superior  populations  with  a  probability  of  at 
least  P*.  When  all  the  superior  populations  are  contained  in  the 
selected  subset,  we  say  a  correct  decision  (CD)  has  been  made. 

Thus  we  require  a  procedure  for  which 

PQ ( CD | R )  i  P* 

for  all  0  e  n. 

2  2 
Let  g  2^sin^  denote  the  probability  density  of  s^n  depend- 

°i 

2 

ing  on  the  parameter  o^.  We  define  the  log-likelihood  ratios 

*n(sin>  -  l09  9A(sin)  "  1o9  (2.1) 

upon  which  the  procedure  is  based. 

El imi nation  type  sequential  select ion  procedu re  R  for  selecting 
the  superior  popul a t io ns . 

Begin  by  taking  n^  ( >_  1 )  independent  observations  from  each 

of  the  k  populations.  Calculate  the  values  of  the  k  log-like- 

2 

1  i hood  ratios  (s . p  ) ,  1  <  i  <_  k .  For  any  i  ,  if 

s(sVia- 

where  a  =  log(k(k+l )/2( 1-P*) ) ,  we  eliminate  the  population  u.  from 
further  consideration.  We  proceed  to  the  next  (second)  stage  by 
taking  n^  -  n-|  independent  observations  on  each  of  the  remaining 
populations.  The  log-likelihood  ratios  for  the  contending  popu¬ 
lations  are  again  computed  and  the  same  elimination  rule  is  used 


i 


2  2 

except  that  it  (s-  )  everywhere  replaces  l  (s •  ).  We  continue 

n^  i n  ^  *  n  i 

iri  this  manner  until  the  elimination  is  stopped,  at  which  time  the 
procedure  is  terminated  with  the  declaration  that  the  remaining 
populations  are  the  superior  populations.  If  after  applying  this 
rule  at  the  sth  stage  (say),  the  number  of  remaining  populations 
is  zero,  then  we  select  the  population  ttq  which  is  the  control 
population. 

Note  that  n^  is  the  sample  size  of  that  stage  of  the  proce¬ 
dure  at  which  a  decision  may  be  made,  for  the  first  time,  to 
reject  one  or  more  populations.  Let  >  n-j  be  the  sample  size  of 
the  next  stage  of  the  procedure  at  which  such  a  decision  may  be 

made,  and  in  general  let  n^  >  n$^  be  the  sample  size  of  the  stage 

of  the  procedure  at  which  the  sth  decision  to  reject  one  or  more 
populations  may  be  made.  Let  N  be  the  stage  at  which  the  proce¬ 
dure  terminates.  It  is  clear  that  if  there  are  k  populations  to 

start  with,  then  N  <  n^  (see  Gupta  and  Huang  (1975)). 

We  assume  that 

P  ^'^(s^)  -  a  f°r  some  (2.2) 


2 

is  a  nondecreasi ng  function  of  o*.  A  sufficient  condition  for 
this  is  discussed  by  Hoel  (1970).  Without  loss  of  generality,  we 
assume  that  t  are  the  superior  populations.  Since  the 

procedure  R  is  truncated,  we  have 

1  - P ( CD | R )  P{vn(s?)  >  a  for  so,ne  1  =  1 . k~t , 

for  some  t,  0  <  t  ^  k,  for  some  nl 
k  k  - 1  o 

}'  )'  P  (s^  )  ^  a  for  some  nl 

t  =  0  i-’l  r.  n  10 
1 


k  k-t  ? 

<  )  }'  P  {>  ( s  ■  )  -  a  for  some  nl 

tJ0  i-’l  '  n 
k  , 

>  (k-t)e  a  ~  l  k  ( k  + 1 )  e  d  r 

t'O  c 


1-P*. 


3.  APPLICATIONS  TO  SELECTION  OF  "GOOD" 
OR  "SUPERIOR"  REGRESSION  EQUATIONS 


Assume  the  following  standard  linear  model  as  follows, 

Y  =  XR  +  e  (3.1) 

where  X  is  an  nxp  known  matrix  of  rank  p  <_  n,  B  is  a  pxl  parameter 

2 

vector,  and  e  -  N ( 0 ,  «o^n)-  Consider  the  models  for  any  r, 

2  <  r  <  p-1 , 


^  ”  *rieri  +  fri 


(3.2) 


where  is  an  nxr  matrix  of  rank  r  with  X^  =  [1,...,1]^  ,  Br^ 

2 

is  a  rxl  parameter  vector,  and  en-  -  N(0,  or^In),  where  i  =  1 , . . . , k r= 


(£']).  Let  k=  l  kr>  The  goal  is  to  include  all  the  designs  Xr^ 
r=2 

2 

(or  sets  of  independent  variables)  associated  with  oj-jj,  J  = 
l,...,k-t. 

Note  that  for  any  r,  2  <  r  <_  p-1,  if 

ssrt  -  HI  -  *H(xHxn.)-'xH)v  .  rqrjY. 

then  following  Searle  (1972,  p.  57) 

SSri/o0  *2{V  (XP)  'Qri  (x3)/(2cr|)l , 

where  vr  =  n-r,  for  1  -  i  <  k^.  Note  that  the  noncentral i ty 
parameter  is  not  zero  in  general  and  that 


p-1 


°ri  =  °0  +  <XP)  'Qv.,*(X6)/v 


*ri 


r. 


2  . 


If  oq  is  not  equal  to  1,  then  we  consider  the  linear  model  Y/oq  = 
Xb/h  +  1 ,  i- '  -  N(0,I  ).  Thus  we  assume  without  loss  of  general- 

u  ^  •* 

i ty  that  oq  =  1 . 

We  know  that  the  non-central  with  non-centrality 

parameter  \  has  monotone  likelihood  ratio  in  x.  Hence  the  monoto¬ 
nicity  of  (2.2)  is  satisfied.  We  can  apply  the  sequential  proce- 

2 

dure  R  to  select  superior  regression  equations  by  replacing  sn-  n 
by  ssn./vr. 


Let  Ur-  =  SSr^/vr-  The  probability  density  of  Ur^  is 

j  Vk-T  •  l(vrUri) 

.  -X  y  A  (vrUri}  e 

9  2  ^ri  ^  *  vr  e  ).  ] 

a  .  k -0  fy  v  +  k  1 

ri  L  k!  22  r  r(l  vr  +  ,)  J 

where  ojY  =  1  +  ( XB ) ' Qp i ( XB )/vr ,  vf  =  n-r  and  >  *  ( Xu) ' Qr f ( XB)/2. 
If  </ ■  1,  then  X  -  0  and  if  o2.  =  A,  then  X  =  (a-1)v  /2.  Hence 


d.(Uri)/()1(UH)  =  e 


-w  4vvii 

dn  k !  L  2  J  ,.,1  . 


r ( 2  ur  +  k) 


(4.1) 


where  '  -  (A-l)vr/2.  Let 


k  '  k 


-».k  fv  U  -Ik  l  v  ) 
e  a  r  ri  2  r ' 

~k! . 2  „/l  .  .  . 

H*  v,.  +  k) 


,  k  =  0,1,2,... 


Since 


a,  k+1 
k 


VrVri]  T-J - .0  as4 

L  2  J  (Jv  ♦  k) 


then  for  any  0  •  <v  <  1,  there  exists  q  such  that 
ak  +  l 

- -  <  A  •  1 ,  for  all  k  >  q . 

ak 

Let  us  consider  the  error  due  to  the  truncation  of  the  series 
in  (4.1).  Lot  q  be  the  number  of  terms  in  the  truncated  series. 
Then  the  error  due  to  truncation  of  the  series  in  (4.1)  is  given 
by 


Y  H 

,_n  'Vk  -  1  -** 


Given  n  >  0,  let  kQ  be  the  smallest  positive  integer  k  such  that 


-k-  <  1  and  !*I+i<  l. 

n  a.  n  - 


Tor  this  k„,  it  is  easy  to  prove  that 


0  <  9A(Uri)/gl(Uri) 


V1 

l 

k=0  K 


1.  ai/  li.  n • 
k=0  K0  K 


kQ-l 


Thus  gA(Uri )/g1 (Uri )  •  l  a^  with  error  less  than  n.  To  eval 


k=0 


uate 


gA(Uri )/g1(Uri ),  the  computation  is  very  efficient. 


5.  EXAMPLE 

In  this  section  we  present  an  example  which  will  serve  to 
illustrate  the  sequential  subset  selection  procedure.  The  data 
set  is  taken  from  Neter  and  Wasserman  (1974,  p.  373),  who  used  it 
to  illustrate  several  methods  of  finding  a  "best"  set  of  indepen¬ 
dent  variables. 

There  are  n  =  55  observations  on  p  =  5  independent  variables. 
Then  k  =  2  -  2  =  14.  For  the  subset  selection  procedure  R,  two 
constants  a  and  P*  with  a  >  1 ,  1  >  P*  >  0,  are  specified  and  we 
wish  to  select  a  subset  containing  all  superior  regression  equa¬ 
tions  with  probability  at  least  P*. 

Begin  by  taking  n^  (>^  5)  independent  observations.  Calculate 
the  values  of  the  k  ratios  gA(Ur^  )/9i(Ur^ )  with  error  n  (specified). 
For  any  r ,  i ,  If 

9Aa>ri)/gi(Uri)  >b 

where  b  =  k(k+l)/2(l-P*),  we  eliminate  the  regression  equation 
from  further  consideration.  We  proceed  to  the  second  stage  by 
taking  n^  -  n1  independent  observations  on  each  of  the  remaining 
regression  equations.  The  ratios  are  again  computed  and  the  same 


elimination  rule  is  used.  We  continue  in  this  manner  until  the 
elimination  is  stopped,  at  which  time  the  procedure  is  terminated 
with  the  declaration  that  the  remaining  regression  equations  are 
the  superior  regression  equations. 

Let  n  =  0.1.  For  the  value  of  gA(Url-  )/g^  (Ur  • ) ,  this  error  of 
n  =  0.1  is  small  enough  with  respect  to  constant  b.  Table  1 1 -V 1 1 
are  the  subsets  of  independent  variables  of  elimination  for  the 
sequential  subset  selection  procedure  R. 

Table  III,  we  consider  a  =  1.2.  If  P*  =  0.9,  then  the  proce¬ 
dure  R  eliminates  (X-j,  X^),  (X-j,  X^)  and  ( X ^ ,  X^)  at  stage  1 

(n^  =  11);  eliminate  (X^,  X^,  X^)  at  stage  2  ( n ^  =  10)  and  elimi¬ 
nate  (X] ,  Xj,  X^)  at  stage  3  (n^  =  21).  No  subset  is  eliminated 
at  stage  4  (n^  =  26).  Thus  the  procedure  is  terminated.  ( X ^ , X ^ ) » 

( X-j ,  x2,  x3),  (xr  x2,  x4),  (xr  x2,  x5),  (xr  x3,  x5),  (xr  x2, 

X3,  X4),  (xp  X2,  X3,  X5),  (Xr  X2,  X4,  X5)  and  (X)f  X3,  X4,  X&) 
are  the  set  of  variables  of  superior  regression  equations.  We  can 
use  0^  statistic  to  select  one  of  good  regression  equations  among 
the  set  of  superior  regression  equations.  For  this  example, 

(X-p  X2,  X^)  is  the  set  of  variable  of  a  good  regression  equation 
(cf.  Neter  and  Wasserman  (  1974)).  Table  1 1 - V 1 1  represents  the 
results  for  a  =  1.1,  1.2,  1.5,  2,  3  and  5;  P*  =  0.7,  0.8  and  0.9. 

TABLE  II 

n  =  0.1,  A  =  1-1. 


0.7  ~  ' '0.4770737 

0,5) 

0.8"  ■075)707) 
0.9 . 075)70737 


0  ,4,6)  ,(1,3.4)  no  rejection _ 

"0  ,375) ,( 1 ,3,4)  no  rejection  _ _ 

0  0) _ r _ , _ .  _ 

0,4,5)771.40  (1,3,4)  no  rejection 


TABLE  III 
=  0.1,  A  =  1.2. 


n 


n 

P* 

11 

16 

21 

26 

0.7 . 

tl ,4.5)  .(1 
.  (MM  V 

5T 

5)' 

1 

no  rejection 

— 

— 

0.8" 

0.4,517(1 
0.4),  (1,3; 

no  rejection 

— 

0.9 

0757,  (T,  4"! 
I  (1.3) _ 

(1,4,5) 

Tl,3,47~ 

f  no  rejection 

TABLE  IV 


i,  =  0. 1  ,  A  =  1.5. 


n 

P*  X 

6 

11 

1  6 

0.7 

0.8 . 

0".  9 

'Ti',4)VO,3T  ' 

'0731 

■(T.3T . 

0,4,57,0757 

(1,3,4) 

no  rejection 

0,4,51,71  ,"57 
1,3,4),  1,4 

no  rejection 

71 ,4,5)  ,"0757 

0,3, 4), (1,4) 

no  rejection 

TABLE  V 

n  =  0.1,  A  -  2 . 

n 

P*^-\ 
O'.  "7 

0.8 

6 

L  __  . 

11 

16 

TM.5Ml.5j 
0,4),  1,3) 

0,3 ,7) 

no  rejection 

7175770, 4T  " 
0,3) 

0,4, 5)  ,77  3, 4)^ 

no  rejection 

0.9 

tl, 577(1, 4)" 

(1,3) 

0 ,4 ,577077,7) 

no  rejection 

TABLE  VI 


n  =  0.1 ,  A  =  3. 


\  n 

p* 

6 

11 

16 

0.7 

[Tl,4,5y,(l,5')Xl,3,4r_ 

_(L4),p,_3) 

no  rejection 

0.8~ 

(14,5),  (1,5) 

1,4), (1,3) 

0,3,4) 

no  rejection 

0.9 . 

"tt;4,5tTi,5) 

0,4),  0.3)  _  J 

073,47 

no  rejection 

TABLE  VII 


n  -  0.1,  A  =  5 . 


n 

N 

6 

11 

0.7 

7T,4,5T,0,5l,Tl,3,40  | 

_  1L4), 0,37  _ 

no  rejection 

0.8  ' 

714777,0,510,3,4) 
0,4), (1,3) 

no  rejection 

0.9  ' 

Ml 4,5),0,5'), 0,3, 41  1 

.11,47,112) . 

no  rejection 
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